AdEase Case Study¶

Gautam Naik (gautamnaik1994@gmail.com)

Github Link: https://github.com/gautamnaik1994/AdEaseTimeSeriesForecastingCaseStudy

nbviewer Open In Colab

About AdEase:
AdEase is a marketing and advertising company that helps businesses maximize clicks at minimal cost. It serves as an advertising infrastructure, enabling businesses to promote themselves effectively, economically, and with ease. The interplay of three AI modules—Design, Dispense, and Decipher—unifies to create an end-to-end, three-step digital advertising solution accessible to all.

Business Problem
The Data Science team at AdEase aims to comprehend the per-page view report for various Wikipedia pages over a span of 550 days. Their objective is to forecast the number of views to optimize ad placement for clients. The team has access to data from 145,000 Wikipedia pages, including daily view counts for each page. Since AdEase's clients operate in diverse regions, they require insights into how their ads will perform on pages in different languages.

Using the forecasted data, we can find out the best time and language to place the ads to maximize the number of views.

Metrics
Following metric will be used to evaluate the model performance:

  • Mean Absolute Percentage Error (MAPE)

Data
The data is provided in the form of a csv file with the following columns:

  • Page: The name of the wikipedia page
  • Date: The date of the view count
  • Views: The number of views on that date

Forecasting Model
Following models will be used to forecast the number of views:

  • ARIMA
  • SARIMAX
  • Prophet
In [77]:
import random
from datetime import timedelta
import warnings

import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.sm_exceptions import ConvergenceWarning
from sklearn.metrics import (
    mean_squared_error as mse,
    mean_absolute_error as mae,
    mean_absolute_percentage_error as mape
)
import optuna
from prophet import Prophet

pl.Config(fmt_str_lengths=100)
sns.set_theme(style="whitegrid")
warnings.simplefilter("ignore", UserWarning)
warnings.filterwarnings("ignore", category=UserWarning, module="statsmodels.tsa.base.tsa_model")
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="statsmodels.base.model")
In [3]:
columns = ['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']

Data Loading¶

In [135]:
df = pl.read_csv("../data/raw/train_1.csv", has_header=True, ignore_errors=True)
lang_df = pl.read_csv("../data/processed/langcodes.csv", has_header=True, ignore_errors=True)
In [136]:
df
Out[136]:
shape: (145_063, 551)
Page2015-07-012015-07-022015-07-032015-07-042015-07-052015-07-062015-07-072015-07-082015-07-092015-07-102015-07-112015-07-122015-07-132015-07-142015-07-152015-07-162015-07-172015-07-182015-07-192015-07-202015-07-212015-07-222015-07-232015-07-242015-07-252015-07-262015-07-272015-07-282015-07-292015-07-302015-07-312015-08-012015-08-022015-08-032015-08-042015-08-05…2016-11-252016-11-262016-11-272016-11-282016-11-292016-11-302016-12-012016-12-022016-12-032016-12-042016-12-052016-12-062016-12-072016-12-082016-12-092016-12-102016-12-112016-12-122016-12-132016-12-142016-12-152016-12-162016-12-172016-12-182016-12-192016-12-202016-12-212016-12-222016-12-232016-12-242016-12-252016-12-262016-12-272016-12-282016-12-292016-12-302016-12-31
stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64…i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"2NE1_zh.wikipedia.org_all-access_spider"181151314992226241910141581688167111020181514491016188597139…204144533281814471514182014161420602215171918212147651732631526142022191820
"2PM_zh.wikipedia.org_all-access_spider"111415181113221110441655738206244151047241722939131112211991533887…1120131962157172319214728222265271717139182217152223191742281593052452620
"3C_zh.wikipedia.org_all-access_spider"101104034411168645123886622324335354…4014588112533373983210546224331174463417
"4minute_zh.wikipedia.org_all-access_spider"35131094426149111616112314514178543022910117711911448141910171710…101417119115108171323401617411789181212181318231032102627161117191011
"52_Hz_I_Love_You_zh.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…141691786412101168798511841558867154117489251331127133610
………………………………………………………………………………………………………………………………………………………………………………………………………
"Underworld_(serie_de_películas)_es.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull1312133510
"Resident_Evil:_Capítulo_Final_es.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Enamorándome_de_Ramón_es.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Hasta_el_último_hombre_es.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
"Francisco_el_matemático_(serie_de_televisión_de_2017)_es.wikipedia.org_all-access_spider"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull…nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
In [134]:
df.describe()
Out[134]:
shape: (9, 552)
statisticPage2015-07-012015-07-022015-07-032015-07-042015-07-052015-07-062015-07-072015-07-082015-07-092015-07-102015-07-112015-07-122015-07-132015-07-142015-07-152015-07-162015-07-172015-07-182015-07-192015-07-202015-07-212015-07-222015-07-232015-07-242015-07-252015-07-262015-07-272015-07-282015-07-292015-07-302015-07-312015-08-012015-08-022015-08-032015-08-04…2016-11-252016-11-262016-11-272016-11-282016-11-292016-11-302016-12-012016-12-022016-12-032016-12-042016-12-052016-12-062016-12-072016-12-082016-12-092016-12-102016-12-112016-12-122016-12-132016-12-142016-12-152016-12-162016-12-172016-12-182016-12-192016-12-202016-12-212016-12-222016-12-232016-12-242016-12-252016-12-262016-12-272016-12-282016-12-292016-12-302016-12-31
strstrf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64…f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"count""145063"124323.0124247.0124519.0124409.0124404.0124580.0124399.0124769.0124819.0124721.0124538.0124578.0124664.0124923.0124957.0125076.0125015.0124768.0124921.0125084.0125375.0125490.0125482.0125470.0125474.0125198.0125304.0125510.0125716.0125471.0125355.0125423.0125219.0125221.0125468.0…140691.0141284.0140647.0141084.0140992.0140834.0141003.0140820.0141388.0140773.0140832.0140960.0140933.0141101.0140884.0141438.0141482.0141525.0141261.0140955.0140985.0141497.0141504.0141397.0141411.0141795.0141827.0141210.0141479.0141874.0141319.0141145.0141362.0141241.0141237.0141428.0141598.0
"null_count""0"20740.020816.020544.020654.020659.020483.020664.020294.020244.020342.020525.020485.020399.020140.020106.019987.020048.020295.020142.019979.019688.019573.019581.019593.019589.019865.019759.019553.019347.019592.019708.019640.019844.019842.019595.0…4372.03779.04416.03979.04071.04229.04060.04243.03675.04290.04231.04103.04130.03962.04179.03625.03581.03538.03802.04108.04078.03566.03559.03666.03652.03268.03236.03853.03584.03189.03744.03918.03701.03822.03826.03635.03465.0
"mean"null1195.8565671204.0036381133.6759691170.4373241217.76931290.273191239.1367541193.0921221197.9917721189.6511971166.8075531296.1844311300.2509711297.2039181216.3322981229.4807561211.8084471168.704821221.0316281258.791421197.7824291191.6638541148.3962961086.956421067.5640451148.7218731184.40781164.0480761285.7682241089.0868971050.5925011014.0911561111.0094081097.3322051207.094247…1326.7091071472.1072241583.9031331460.7190821533.1421571400.3557171389.0071911349.2667731440.0365311525.5115971513.7585281495.341921444.8306641416.2891621369.5011431360.4512151439.7034251457.8776541393.2173571456.4956621338.3562511265.571851280.5235971474.6822071561.7899311501.7013581418.063071394.0955031377.4817181393.0992081523.7399221679.6065541678.301871633.9656051684.3077171467.9433781478.282137
"std"null72753.51867174215.14542469610.22474472573.51369973796.11665680544.48468175762.87692768200.01700971497.17077272145.36130870073.4221872104.7757273558.37419470477.88161269942.14752970234.43140667824.62159865642.01462567030.49726772734.57326170387.99605469662.3948169870.60010566189.23124968316.15420672463.72950872074.30680171046.72376199553.59822765866.92890163828.27687864364.82742668948.70884767922.59275873395.140765…70854.09010381131.61333578806.58570475828.880781176.73717772051.21887575120.63123569489.89552582952.83087980284.26480989876.75287784550.84869681250.09888277861.50326369328.10959780732.98305780818.01690387725.54056783851.13229780683.16203173740.36749269217.83675565046.04386578891.26961889254.65812389647.5466391070.03910685748.80094377327.94022584785.32794687522.10431697945.34412492324.82005691858.30766890142.65681481554.81414688735.672589
"min""!vote_en.wikipedia.org_all-access_all-agents"0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0…0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
"25%"null13.013.012.013.014.011.013.013.014.014.013.012.014.013.014.013.014.013.014.015.011.012.011.09.010.09.09.011.012.09.09.09.09.09.011.0…21.021.023.022.023.024.022.022.021.022.022.022.022.022.022.021.021.023.021.021.023.023.023.023.023.024.023.022.022.020.021.022.023.024.023.023.021.0
"50%"null109.0108.0105.0105.0113.0113.0115.0117.0115.0113.0107.0118.0120.0119.0114.0114.0113.0109.0120.0121.0115.0115.0110.0100.098.0106.0108.0108.0107.0102.097.090.097.0100.0117.0…151.0161.0181.0161.0160.0157.0153.0154.0157.0176.0163.0160.0158.0156.0153.0156.0172.0162.0159.0156.0150.0146.0150.0166.0161.0155.0151.0149.0143.0132.0145.0160.0162.0163.0160.0154.0136.0
"75%"null524.0519.0504.0487.0540.0555.0551.0554.0549.0545.0516.0582.0590.0583.0559.0560.0545.0516.0576.0592.0567.0562.0531.0482.0461.0510.0525.0515.0508.0493.0464.0422.0473.0492.0576.0…659.0686.0788.0730.0718.0694.0681.0669.0667.0763.0727.0714.0706.0691.0667.0660.0741.0723.0704.0693.0657.0615.0619.0696.0681.0646.0623.0607.0598.0569.0628.0659.0668.0654.0649.0635.0561.0
"max""[Alexandros]_ja.wikipedia.org_mobile-web_all-agents"2.0381245e72.0752194e71.9573967e72.0439645e72.0772109e72.2544669e72.1210887e71.9107911e71.9993848e72.0201821e71.9697811e72.0232622e72.0516284e71.9701242e71.9563106e71.9658463e71.8989969e71.8364385e71.8825015e72.0338627e71.9740114e71.9536361e71.9620144e71.8639334e71.9317992e72.0404878e72.0155604e71.9891238e72.7562967e71.8497101e71.7928619e71.8184589e71.9406979e71.8952314e72.0533464e7…2.0059135e72.2898094e72.2245105e72.1392793e72.2782984e72.0294284e72.1174392e71.9551058e72.2703751e72.2657555e72.5201283e72.3745159e72.2923017e72.1902049e71.9534309e72.2855008e72.2928843e72.47652e72.3661288e72.2659908e72.0834185e71.9653625e71.8573504e72.2423821e72.5067237e72.5044262e72.6283835e72.4201081e72.2539246e72.505662e72.5865746e72.8342885e72.6916991e72.7025053e72.6073819e72.4363967e72.6149541e7
In [137]:
df=df.unique()

Null Value Check

In [138]:
df.null_count().transpose(include_header=True)
Out[138]:
shape: (551, 2)
columncolumn_0
stru32
"Page"0
"2015-07-01"20740
"2015-07-02"20816
"2015-07-03"20544
"2015-07-04"20654
……
"2016-12-27"3701
"2016-12-28"3822
"2016-12-29"3826
"2016-12-30"3635
"2016-12-31"3465

Observations

  • We can see that the dataset has lot of null values at the start and decreases at the end

Data Cleaning¶

In [139]:
df=df.fill_nan(0)
df=df.fill_null(0)
In [140]:
df.select("Page").sample(10, seed=random.randint(0,1000))
Out[140]:
shape: (10, 1)
Page
str
"Jamie_Dornan_de.wikipedia.org_desktop_all-agents"
"Große_Winkelspinne_de.wikipedia.org_all-access_spider"
"兵家綺_zh.wikipedia.org_mobile-web_all-agents"
"Rally_Dakar_es.wikipedia.org_mobile-web_all-agents"
"Дэдпул_ru.wikipedia.org_all-access_spider"
"How_to_contribute/wuu_www.mediawiki.org_desktop_all-agents"
"高梁碧_ja.wikipedia.org_all-access_all-agents"
"魏如昀_zh.wikipedia.org_desktop_all-agents"
"Lion_(2016_film)_en.wikipedia.org_mobile-web_all-agents"
"Massaker_von_Srebrenica_de.wikipedia.org_all-access_all-agents"
In [141]:
db.sql("""
select Page from df where Page like '%_ja.%'
""").pl().sample(5)
Out[141]:
shape: (5, 1)
Page
str
"熊切あさ美_ja.wikipedia.org_mobile-web_all-agents"
"吉田真由子_ja.wikipedia.org_mobile-web_all-agents"
"玉木宏_ja.wikipedia.org_all-access_spider"
"椿隆之_ja.wikipedia.org_all-access_all-agents"
"岡田有希子_ja.wikipedia.org_all-access_spider"
In [142]:
# 'Manual:Installation_guide/en-gb_www.mediawiki.org_mobile-web_all-agents',
extracted_df = df.select(
    pl.col("Page").str.extract(r"/(\w{2})_www.mediawiki.org", group_index=1).alias("mediawiki_lang"),
    pl.col("Page").str.extract(r"_(\w{2})\.wikipedia", group_index=1).alias("wikipedia_lang"),
    pl.col("Page").str.extract(r"/(\w{2}-\w{2})_www.mediawiki.org", group_index=1).alias("separator_lang"),
)
In [143]:
extracted_df=extracted_df.with_columns(
    pl.concat_str([pl.col("mediawiki_lang"), pl.col("wikipedia_lang"), pl.col("separator_lang")], separator="", ignore_nulls=True).alias("lang")
)
extracted_df
Out[143]:
shape: (145_063, 4)
mediawiki_langwikipedia_langseparator_langlang
strstrstrstr
null"fr"null"fr"
null"ja"null"ja"
nullnullnull""
null"en"null"en"
nullnullnull""
…………
null"en"null"en"
nullnullnull""
null"zh"null"zh"
null"es"null"es"
null"de"null"de"
In [144]:
extracted_df.group_by("lang").len().sort("len", descending=True).limit(10)
Out[144]:
shape: (10, 2)
langlen
stru32
"en"24120
"ja"20515
"de"18692
"fr"17897
"zh"17330
""15775
"ru"15131
"es"14172
"ar"70
"pt-br"66
In [145]:
df=df.with_columns(
    pl.col("Page").str.split("_").list[-1].alias("Agent"),
    pl.col("Page").str.split("_").list[-2].alias("Access"),
)
In [146]:
df = df.hstack(extracted_df.select("lang"))
In [147]:
df = df.with_columns(
    pl.col("Page").str.replace_all(r"_all-access|_all-agents|_desktop|_all-access|_spider|_mobile-web", "").alias("Page"),
)
In [148]:
df.select("Page", "lang").filter(pl.col("lang")=="").sample(10, seed=random.randint(0,1000))
Out[148]:
shape: (10, 2)
Pagelang
strstr
"File:Decathlon_Logo.png_commons.wikimedia.org"""
"File:CharlesManson2014.jpg_commons.wikimedia.org"""
"Category:Chikki_Panday_commons.wikimedia.org"""
"UploadWizard_www.mediawiki.org"""
"File:Brickwork_for_wall_construction.jpg_commons.wikimedia.org"""
"Topic:T9a0kjggitqfxog0_www.mediawiki.org"""
"Commons:Wiki_Loves_Earth_2016_in_Kosovo_commons.wikimedia.org"""
"Category:Doggy_style_in_art_commons.wikimedia.org"""
"Extension:Lucene-search_www.mediawiki.org"""
"Special:MyLanguage/Manual:Configuration_settings_www.mediawiki.org"""
In [149]:
df=df.with_columns(
    pl.when(pl.col("lang") == "").then(pl.lit("other")).otherwise(pl.col("lang")).alias("lang")
)
In [150]:
df.group_by("lang").len().sort("len", descending=True).limit(10)
Out[150]:
shape: (10, 2)
langlen
stru32
"en"24120
"ja"20515
"de"18692
"fr"17897
"zh"17330
"other"15775
"ru"15131
"es"14172
"ar"70
"pt-br"66
In [151]:
df = df.with_columns(
    pl.col("lang").count().over("lang").alias("lang_count")
)
In [152]:
df = df.with_columns(
    pl.when(pl.col("lang_count") < 1000).then(pl.lit("other")).otherwise(pl.col("lang")).alias("lang")
)
In [153]:
df=df.drop("lang_count")
In [154]:
df.null_count().transpose(include_header=True)
Out[154]:
shape: (554, 2)
columncolumn_0
stru32
"Page"0
"2015-07-01"0
"2015-07-02"0
"2015-07-03"0
"2015-07-04"0
……
"2016-12-30"0
"2016-12-31"0
"Agent"0
"Access"0
"lang"0
In [155]:
# df=df.cast({pl.Int64: pl.Int32})
In [156]:
lang_df
Out[156]:
shape: (284, 2)
codelang
strstr
"ce""Chechen"
"ml""Malayalam"
"vi""Vietnamese"
"be""Belarusian"
"os""Ossetian"
……
"st""Sesotho"
"lt""Lithuanian"
"ku""Kurdish"
"nd""North Ndebele"
"ik""Inupiak"
In [157]:
df=db.sql("""
    select df.*,  ifnull(lang_df.lang,'Other') as lan from df left join lang_df on df.lang = lang_df.code
""").pl().drop("lang").rename({"lan": "lang"})
In [158]:
df.write_parquet("../data/cleaned/df.parquet")

Preprocessing the data¶

In [159]:
cleaned_df=pl.read_parquet("../data/cleaned/df.parquet")
In [160]:
cleaned_df.head(10)
Out[160]:
shape: (10, 554)
Page2015-07-012015-07-022015-07-032015-07-042015-07-052015-07-062015-07-072015-07-082015-07-092015-07-102015-07-112015-07-122015-07-132015-07-142015-07-152015-07-162015-07-172015-07-182015-07-192015-07-202015-07-212015-07-222015-07-232015-07-242015-07-252015-07-262015-07-272015-07-282015-07-292015-07-302015-07-312015-08-012015-08-022015-08-032015-08-042015-08-05…2016-11-282016-11-292016-11-302016-12-012016-12-022016-12-032016-12-042016-12-052016-12-062016-12-072016-12-082016-12-092016-12-102016-12-112016-12-122016-12-132016-12-142016-12-152016-12-162016-12-172016-12-182016-12-192016-12-202016-12-212016-12-222016-12-232016-12-242016-12-252016-12-262016-12-272016-12-282016-12-292016-12-302016-12-31AgentAccesslang
stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64…i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64strstrstr
"Jamie_Foxx_en.wikipedia.org"9819527456535571569552535197494048304975480365375744454045874866492750945533472346354660462346415217564849974946465848615154475558921000353644518…3850370334205007446350985407457442084341442849684493437040196256519944834687509155277372616849995358417844185204614859286153818266005621"all-agents""all-access""English"
"Ryan_Phillippe_fr.wikipedia.org"546361156324531138480242421532231022…8341048842136347543155504230442111134136"spider""all-access""French"
"徐朱玄_zh.wikipedia.org"336317333333465988736736719579611495410355361385412363355341450377371302282265302288314279292248259293240265…166202150158151230212178175140164177189186173141155141121150172122136151132170164188156195199217197172"all-agents""desktop""Chinese"
"血型_zh.wikipedia.org"337291300249251243254224822338260261205841395298281317293235278277303214243265235235195188186174198204219379…297252275292244296291330391343360305408315287311342291322310313305301285349377366258264258341289250255"all-agents""mobile-web""Chinese"
"Christina_Grimmie_en.wikipedia.org"106713381187110711531166112711391122113110661101110311191044117011611217123611711105113012691250116211901331125014641203125210991215104912051266…4415462047154876542761255638473184081058010414832977138076735868157754107861180810147688571066569105551352593577168637086751082720408250472246521219"all-agents""all-access""English"
"Adele_en.wikipedia.org"9763856679487758857398494052518017212876601296156781772576585932314419623233…161171171173207239142167170220269445312192210201332157202220153214261299150194211182209322688259161171"spider""all-access""English"
"Élodie_Fontan_fr.wikipedia.org"226537155354662326446614634051402335…31342746281331025228576696513714251213128"spider""all-access""French"
"Loi_de_Murphy_fr.wikipedia.org"10661148102884087795997110831164100778488986082610011046974833907106911191093997994882899104310191066946100289680676210422256…1247954102189079172971818961269118392085970490291299796679490465864068974478574578378775877378611911167854631"all-agents""all-access""French"
"おのののか_ja.wikipedia.org"1810187930571963170116209191201132876924211503817133179014261444163639250156927354500842343985370427302443516523083278215817233105186426783286…578826673529473131010104754243915525819257096981138616587394691157146640733337343087915675194327122613147571715"all-agents""mobile-web""Japanese"
"Silvestre_(cantante)_es.wikipedia.org"1410891119121124151814171811164021241916151916151220142320192226142115…171618141412101117141123157409546726173477262611604048635521971321556750371137035253236"all-agents""desktop""Spanish"
In [161]:
page_view_count = cleaned_df.select(pl.exclude("Page", "lang", "Agent", "Access")).sum_horizontal().to_frame("page_view_count")
cleaned_df = cleaned_df.hstack(page_view_count)
In [162]:
page_view_count
Out[162]:
shape: (145_063, 1)
page_view_count
i64
3192217
2694
167814
190388
6054032
…
1678
26091
11979
818
9236
In [165]:
fig, ax = plt.subplots(1,2, figsize=(15, 6))

cleaned_df.group_by("lang").len().sort("len", descending=True).to_pandas().plot.bar(x="lang", y="len", ax=ax[0])
ax[0].set_title("Number of pages per language");

cleaned_df.group_by("lang").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="lang", y="page_view_count", ax=ax[1])
ax[1].set_title("Total page views per language");
No description has been provided for this image

Observations

  • From above plot we can that English language has the highest number of pages
  • This has resulted in the highest number of views as well
In [166]:
fig, ax = plt.subplots(1,2, figsize=(15, 6))

cleaned_df.group_by("Access").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="Access", y="page_view_count", ax=ax[0])
ax[0].set_title("Total page views per Access");

cleaned_df.group_by("Agent").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="Agent", y="page_view_count", ax=ax[1])
ax[1].set_title("Total page views per Agent");
No description has been provided for this image

Observations

  • From above plot we can see most views are from "all-agents" source and "all-access"
In [168]:
cleaned_df.group_by("Page").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).limit(10).to_pandas().plot.barh(x="Page", y="page_view_count")
Out[168]:
<Axes: ylabel='Page'>
No description has been provided for this image

Observations

  • Above is the list of top 10 pages with highest number of views
  • We can see that English homepage has the highest number of views
In [176]:
db.sql(""" 
       with cte as (
            select 
                Page, 
                lang, 
                sum(page_view_count) as page_view_count, 
                rank() over (partition by lang order by sum(page_view_count) desc) as rank 
                from cleaned_df 
                group by Page, 
                lang 
          )
    select * from cte where rank = 1 order by page_view_count desc
""").pl()
Out[176]:
shape: (8, 4)
Pagelangpage_view_countrank
strstrdecimal[38,0]i64
"Main_Page_en.wikipedia.org""English"241236838491
"Wikipedia:Hauptseite_de.wikipedia.org""German"31679314101
"Заглавная_страница_ru.wikipedia.org""Russian"21642571491
"Wikipédia:Accueil_principal_fr.wikipedia.org""French"17832332991
"Wikipedia:Portada_es.wikipedia.org""Spanish"15136244391
"メインページ_ja.wikipedia.org""Japanese"4175011941
"Wikipedia:首页_zh.wikipedia.org""Chinese"2456979011
"Special:Search_commons.wikimedia.org""Other"1349684471

Observations

  • Above is the list of top viewed pages in each language
In [177]:
lang_grp = cleaned_df.group_by("lang").sum().drop("Agent", "Access", "page_view_count", "Page")
lang_grp
Out[177]:
shape: (8, 551)
lang2015-07-012015-07-022015-07-032015-07-042015-07-052015-07-062015-07-072015-07-082015-07-092015-07-102015-07-112015-07-122015-07-132015-07-142015-07-152015-07-162015-07-172015-07-182015-07-192015-07-202015-07-212015-07-222015-07-232015-07-242015-07-252015-07-262015-07-272015-07-282015-07-292015-07-302015-07-312015-08-012015-08-022015-08-032015-08-042015-08-05…2016-11-252016-11-262016-11-272016-11-282016-11-292016-11-302016-12-012016-12-022016-12-032016-12-042016-12-052016-12-062016-12-072016-12-082016-12-092016-12-102016-12-112016-12-122016-12-132016-12-142016-12-152016-12-162016-12-172016-12-182016-12-192016-12-202016-12-212016-12-222016-12-232016-12-242016-12-252016-12-262016-12-272016-12-282016-12-292016-12-302016-12-31
stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64…i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
"Spanish"152799981460244913428998126078031371197415627539152326411478341914504467131859041185831113510484153275781609677215212758144858001303692611343056121325081425290514118075145280191363128912107623108492231196494113780739136785651383215813156141118859231038088311394918117438301424552814416982…15764081166274211919011819453791229366441985512417733972158453421280450314932115167992861785975016370034158326071429047912123873134786941531050515285661161083081420951812937405121713781425187115949043154882941514513513839162131574371238608012714491168204071594667716578967156484381156210111079339
"Other"147706915487881464346124944713850361586847167956017557401530950154728111656011205769141186615889201683736161679817730961595188146816714653131816109195567717158261622922147960514089641782903168385717977601715689289251916802683061505167019319629562007787…2041914215872021423162520928415770424482142377015232311826237052827612282230726465632606370254284148270183734861285978128704422476462262460325412532441475308171429894612613503302741836806412336215293194624912512913440294793126152803316427262002629214742559123
"French"846017785143328187395875094585915908951425865242084932568405132793202378602038976095875329993876938662008847025981533288252934894736687260228498414855148580559297541728755727481996607949596785171979640707603077714211669164627058655720051186537578349621…10935645125113421411191012358937123081221145255610940544107768211210720514522990131344741299188512184132119505831188590612136360133136531334833612150770116591491137813111908293124596461425239415021807131435041205796311608728113852521100668311868613166798791528289513783146134012501247294811505867
"English"8471232384438676801678078346335186198774928095698783816382880435847993448431960484789202932790239239205292916294858975498691125283306009804077008535585889703719854456108410186282835476805288808068897186443149878391628654729110099797079149920765614327669502683480508787723288491438185031471…107788266121988297127491196116790779120243129112338727111159612109467860123415872124686850127319077126321875119067662116664998108556921113151730117612769121038477115257216121089002110158675103840461103804092120029012130940080128043995118888302120491905112905519119861247123926452139112369145628953141278597150557723125404796123624015
"Chinese"414707141527934125028416478444428074466109446126345773334549099472930344125804971098456212243095814226023421570442337024411839497359146781154602217465935541469853601962346503537409493741118369348937305813586615337972230720963374013374364544135564576750…6032057761280375693126196755631507061509575978330604177568395447111855634930861660045994778577303660996856548693741642063953666185631604120258415045599683621141168858946197343625117064297615948263586723762166146571114648725364801286515403604418261140036300124
"Japanese"118649361362247912307000154576271482861812922630125708161249431112179762126543311306824314059069125173981172709311752028134439861609338713532831132699061327363711599116116382571105771810060492980482310497848101453749917127107989269743033913077488629729887186106824501234480812153812…15376787159891461734239814386738163454391360193616271026152320481587046717509081153258431439714217702667155956821688104016319927174692491594650315250668166058871399933613786593156867391900846615993510160592541454778813794046197784451750224116721058159190631612445916152145176835871945239024461932
"Russian"946626796300208925576839537789407689631695941116593665259594537109875031011550011188984115409371118701610706689109184051078200996662201070618310985704106914821043514493780838321106758743784774909162428882800086555818443422829339574285128290129105218321106854910585083…13585166155629121653295216003385166463001571837215934987150059491540663216114535148128961446125113869482142293161381803914327809146814381453420014291605142879361482712213838997139568641498154214423262148025591432349513466853132940911313495816841275167187511504181314002376134805811206902013224530
"German"132646351308430312558048115236031339580514746420143013451450989213968844130185381204223914285542155892351483723613848444137163311411577616607195156789131436925113400949136720981328175812595709125191631308468014009715138996261386459213250925124111421215413612572573130742471384816613260306…15132115155345561839102218372778172083711565181115458695153108341453595717045806166224501593892715829195172507541658171114076246168601161688230615909264168842591573261714721713138273671611700019715728161175501604654615375054155648091504548323776959223824142012790419155875184507821761156416566864
In [178]:
df=lang_grp.unpivot(index="lang", variable_name="dates").pivot("lang", index="dates")
df=df.with_columns(
    pl.col("dates").cast(pl.Date),
)
In [179]:
campaign_df = pl.read_csv("../data/processed/Exog_Campaign_eng.csv", has_header=True, ignore_errors=True)
campaign_df
Out[179]:
shape: (550, 1)
Exog
i64
0
0
0
0
0
…
1
1
1
0
0
In [180]:
df=df.hstack(campaign_df)
df.head(10)
Out[180]:
shape: (10, 10)
datesSpanishOtherFrenchEnglishChineseJapaneseRussianGermanExog
datei64i64i64i64i64i64i64i64i64
2015-07-011527999814770698460177847123234147071118649369466267132646350
2015-07-021460244915487888514332844386764152793136224799630020130843030
2015-07-031342899814643468187395801678074125028123070008925576125580480
2015-07-041260780312494478750945834633514164784154576278395377115236030
2015-07-051371197413850368591590861987744442807148286188940768133958050
2015-07-061562753915868478951425928095694466109129226309631695147464200
2015-07-071523264116795608652420878381634461263125708169411165143013450
2015-07-081478341917557408493256828804354577333124943119366525145098920
2015-07-091450446715309508405132847993444549099121797629594537139688440
2015-07-1013185904154728179320238431960447293031265433110987503130185380
In [196]:
df = df.with_columns(
    pl.col("dates").dt.strftime("%A").alias("weekday"),
   pl.col("dates").dt.strftime("%B").alias("month"),
    pl.col("dates").dt.year().alias("year"),
)
In [195]:
fig, ax = plt.subplots(4, 2, figsize=(20, 25))

for i, col in enumerate(columns):
    df.group_by("weekday").agg(pl.sum(col)).to_pandas().plot.bar(x="weekday", y=col, ax=ax[i//2, i%2])
    ax[i//2, i%2].set_title(f"{col} page views per weekday")

plt.tight_layout()
No description has been provided for this image

Observations

In [197]:
fig, ax = plt.subplots(4, 2, figsize=(20, 25))

for i, col in enumerate(columns):
    df.group_by("month").agg(pl.sum(col)).to_pandas().plot.bar(x="month", y=col, ax=ax[i//2, i%2])
    ax[i//2, i%2].set_title(f"{col} page views per month")

plt.tight_layout()
No description has been provided for this image

Observations

  • Above plot shows page views for each language for each month
  • For Spanish language October, September and November have the highest number of views
In [185]:
df.head(10)
Out[185]:
shape: (10, 13)
datesSpanishOtherFrenchEnglishChineseJapaneseRussianGermanExogweekdaymonthyear
datei64i64i64i64i64i64i64i64i64stri8i32
2015-07-011527999814770698460177847123234147071118649369466267132646350"Wednesday"72015
2015-07-021460244915487888514332844386764152793136224799630020130843030"Thursday"72015
2015-07-031342899814643468187395801678074125028123070008925576125580480"Friday"72015
2015-07-041260780312494478750945834633514164784154576278395377115236030"Saturday"72015
2015-07-051371197413850368591590861987744442807148286188940768133958050"Sunday"72015
2015-07-061562753915868478951425928095694466109129226309631695147464200"Monday"72015
2015-07-071523264116795608652420878381634461263125708169411165143013450"Tuesday"72015
2015-07-081478341917557408493256828804354577333124943119366525145098920"Wednesday"72015
2015-07-091450446715309508405132847993444549099121797629594537139688440"Thursday"72015
2015-07-1013185904154728179320238431960447293031265433110987503130185380"Friday"72015
In [179]:
df.write_parquet("../data/processed/df.parquet")

Time Series Analysis¶

In [84]:
df=pl.read_parquet("../data/processed/df.parquet").to_pandas()
In [85]:
df=df.set_index("dates")
df=df.drop("Other", axis=1)
df
Out[85]:
Spanish Japanese English Chinese German Russian French Exog
dates
2015-07-01 15279998 11864936 84712323 4147071 13264635 9466267 8460177 0
2015-07-02 14602449 13622479 84438676 4152793 13084303 9630020 8514332 0
2015-07-03 13428998 12307000 80167807 4125028 12558048 8925576 8187395 0
2015-07-04 12607803 15457627 83463351 4164784 11523603 8395377 8750945 0
2015-07-05 13711974 14828618 86198774 4442807 13395805 8940768 8591590 0
... ... ... ... ... ... ... ... ...
2016-12-27 15946677 16124459 145628953 6480128 20127904 15041813 15282895 1
2016-12-28 16578967 16152145 141278597 6515403 19155875 14002376 13783146 1
2016-12-29 15648438 17683587 150557723 6044182 18450782 13480581 13401250 1
2016-12-30 11562101 19452390 125404796 6114003 17611564 12069020 12472948 0
2016-12-31 11079339 24461932 123624015 6300124 16566864 13224530 11505867 0

550 rows × 8 columns

In [36]:
columns = ['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']
In [37]:
columns
Out[37]:
['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']

Combined Time Series Plot

In [86]:
df["Exog_scaled"] = df["Exog"]*df["English"]
filtered_df = df[df["Exog_scaled"] > 0]
df=df.drop("Exog_scaled", axis=1)
In [87]:
df.drop(["Exog"], axis=1).plot(figsize=(20, 8));
plt.scatter(filtered_df.index, filtered_df["Exog_scaled"], color="green", label="Exog_scaled")
plt.title("Page Views by Language");
No description has been provided for this image

Observations

  • We can see that English language has highest number of page views.
  • We can also see a surge in views in the month of August for English and Russian languages.
  • This can be due to the campaigns run

Individual Time Series Plot

In [41]:
import matplotlib.dates as mdates

fig, ax = plt.subplots(len(columns), 1, figsize=(20, 25))

for i, col in enumerate(columns):
    ax[i].plot(df[col])
    ax[i].set_title(col, fontsize=16)
    ax[i].xaxis.set_major_locator(mdates.MonthLocator())
    ax[i].xaxis.set_major_formatter(mdates.DateFormatter('%b'))
    plt.setp(ax[i].xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()
No description has been provided for this image

Observations

  • Above plot shows the time series plot for each language
  • We can see that Spanish has double seasonality
  • We can see that there were sudden spikes in English, French and Russian language views
  • This can be dur to some campaigns run

Distribution of views count

In [42]:
fig, ax = plt.subplots(4, 2, figsize=(20, 15))

for i, col in enumerate(columns):
    df[col].hist(ax=ax[i//2, i%2], bins=100)
    ax[i//2, i%2].set_title(col, fontsize=16)
# plt.tight_layout()
No description has been provided for this image

Observations

  • From above plot we can see that Russian has large number of outliers

Stationarity Check¶

AD Fuller Test

In [46]:
def is_stationary(series, significance_level=0.05):
    adf_test = adfuller(series)
    p_value = adf_test[1]
    return p_value < significance_level , p_value
In [48]:
adf_df = pd.DataFrame(columns=["Lang", "p-value", "is_stationary"])

for column in columns:
    result, p_value = is_stationary(df[column])
    adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)

adf_df.sort_values("p-value", ascending=False)
C:\Users\v-gautamnaik\AppData\Local\Temp\ipykernel_7352\3790388776.py:5: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
Out[48]:
Lang p-value is_stationary
3 Chinese 0.447261 False
2 English 0.189574 False
4 German 0.139171 False
1 Japanese 0.102580 False
6 French 0.051539 False
0 Spanish 0.033589 True
5 Russian 0.001864 True

Observations

  • Above table shows which language time series is stationary and which is not.
  • Only Russian and Spanish language time series are stationary

Decomposition¶

Decomposition is a useful tool to understand the various components of a time series. We can generate a plot of the decomposition to understand the trend, seasonality and residuals. Benefits of decomposition are:

  • It helps in understanding the trend and seasonality in the data
  • It helps in understanding the residuals

Basically decomposition helps in understanding the data better by removing the noise and focusing on the trend and seasonality.

English

In [48]:
plt.rcParams['figure.figsize'] = (20, 8)
seasonal_decompose(df["English"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there is upward trend in the page views count
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are spread around 0 suggesting no underlying trend

Spanish

In [50]:
seasonal_decompose(df["Spanish"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there is no trend in the page views count
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are spread around 0 suggesting no underlying trend

Russian

In [52]:
seasonal_decompose(df["Russian"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there is small trend in the page views count
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are tightly spread around 0 suggesting no underlying trend

Japanese

In [54]:
seasonal_decompose(df["Japanese"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there was a jump in view but then it is constant
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are evenly spread around 0 suggesting no underlying trend

French

In [56]:
seasonal_decompose(df["French"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there is small trend in the page views count
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are evenly spread around 0 suggesting no underlying trend

German

In [57]:
seasonal_decompose(df["German"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there are ups and downs in the page views count but no clear trend
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are evenly spread around 0 suggesting no underlying trend

Chinese

In [58]:
seasonal_decompose(df["Chinese"], model="additive").plot();
No description has been provided for this image

Observations

  • Trend: We can see that there was a jump in view but then it is constant
  • Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
  • Residuals: The residuals are evenly spread around 0 suggesting no underlying trend

Removing Stationarity¶

In [88]:
df_st = df.copy()

Single Differencing

In [89]:
df_st[columns] = df_st[columns].diff()
In [90]:
df_st=df_st.dropna()
df_st
Out[90]:
Spanish Japanese English Chinese German Russian French Exog
dates
2015-07-02 -677549.0 1757543.0 -273647.0 5722.0 -180332.0 163753.0 54155.0 0
2015-07-03 -1173451.0 -1315479.0 -4270869.0 -27765.0 -526255.0 -704444.0 -326937.0 0
2015-07-04 -821195.0 3150627.0 3295544.0 39756.0 -1034445.0 -530199.0 563550.0 0
2015-07-05 1104171.0 -629009.0 2735423.0 278023.0 1872202.0 545391.0 -159355.0 0
2015-07-06 1915565.0 -1905988.0 6610795.0 23302.0 1350615.0 690927.0 359835.0 0
... ... ... ... ... ... ... ... ...
2016-12-27 -873730.0 205396.0 6516584.0 -7125.0 -2254510.0 -1676938.0 -1396984.0 1
2016-12-28 632290.0 27686.0 -4350356.0 35275.0 -972029.0 -1039437.0 -1499749.0 1
2016-12-29 -930529.0 1531442.0 9279126.0 -471221.0 -705093.0 -521795.0 -381896.0 1
2016-12-30 -4086337.0 1768803.0 -25152927.0 69821.0 -839218.0 -1411561.0 -928302.0 0
2016-12-31 -482762.0 5009542.0 -1780781.0 186121.0 -1044700.0 1155510.0 -967081.0 0

549 rows × 8 columns

In [76]:
adf_df = pd.DataFrame(columns=["Lang", "p-value", "is_stationary"])

for column in columns:
    result, p_value = is_stationary(df_st[column])
    adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)

adf_df.sort_values("p-value", ascending=False)
C:\Users\v-gautamnaik\AppData\Local\Temp\ipykernel_7352\1076179733.py:5: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
Out[76]:
Lang p-value is_stationary
0 Spanish 1.499282e-07 True
4 German 2.488111e-10 True
3 Chinese 1.175377e-11 True
2 English 5.285073e-13 True
6 French 1.056177e-17 True
1 Japanese 6.206743e-20 True
5 Russian 2.485269e-30 True

Observations

  • After applying differencing we can see that all the languages are stationary

ACF and PACF¶

In [25]:
params={}
# params["Spanish"] = {
#     "order": (1, 1, 1), p: AR -> PACF , d: Difference, q: MA -> ACF
#     "seasonal_order": (1, 1, 1, 12), P: Seasonal AR, D: Seasonal Difference, Q: Seasonal MA, m: Seasonal Period
# }
In [4]:
df_st=pd.read_parquet("../data/processed/df_st.parquet")

English

In [77]:
plt.rcParams['figure.figsize'] = (20, 8)
plot_acf(df_st["English"], lags=100, zero=False);
plot_pacf(df["English"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
In [26]:
params["English"] = {
     "order": (1, 1, 5), 
     "seasonal_order": (1, 0, 1, 7) 
    }

Spanish

In [78]:
plot_acf(df_st["Spanish"], lags=100, zero=False);
plot_pacf(df["Spanish"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there are multiple spikes in PACF plots till lag 6 which suggests AR(6) model
In [92]:
params["Spanish"] = {
    "order": (6, 0, 6),
    "seasonal_order": (2, 0, 14, 7),
}

Russian

In [79]:
plot_acf(df_st["Russian"], lags=100, zero=False);
plot_pacf(df["Russian"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are small spikes in ACF plot
  • We can see that there is a spike at till lag 3 in PACF plot which suggests AR(1) model
In [93]:
params["Russian"] = {
    "order": (3, 0, 2),
    "seasonal_order": (1, 0, 1, 14),
}

Japanese

In [80]:
plot_acf(df_st["Japanese"], lags=100, zero=False);
plot_pacf(df["Japanese"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there are spikes till lag 2 in PACF plot which suggests AR(2) model
In [94]:
params["Japanese"] = {
    "order": (2, 1, 4),
    "seasonal_order": (1, 0, 15, 7),
}

French

In [81]:
plot_acf(df_st["French"], lags=100, zero=False);
plot_pacf(df["French"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there is are spike till lag 3 in PACF plot which suggests AR(3) model
In [95]:
params["French"] = {
    "order": (3, 1, 2),
    "seasonal_order": (2, 0, 9, 7),
}

German

In [82]:
plot_acf(df_st["German"], lags=100, zero=False);
plot_pacf(df["German"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
In [96]:
params["German"] = {
    "order": (1, 1, 2),
    "seasonal_order": (3, 0, 14, 7),
}

Chinese

In [83]:
plot_acf(df_st["Chinese"], lags=100, zero=False);
plot_pacf(df["Chinese"], lags=50, zero=False);
No description has been provided for this image
No description has been provided for this image

Observations

  • We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
  • We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
In [97]:
params["Chinese"] = {
    "order": (1, 1, 5), 
    "seasonal_order": (4, 0, 14, 7), 
}
In [93]:
df_st.to_parquet("../data/processed/df_st.parquet")
df.to_parquet("../data/processed/df_pandas.parquet")
In [98]:
params
# import json
# with open("../data/processed/params.json", "w") as f:
#     json.dump(params, f)
Out[98]:
{'English': {'order': (1, 1, 5), 'seasonal_order': (1, 0, 1, 7)},
 'Spanish': {'order': (6, 1, 6), 'seasonal_order': (2, 0, 14, 7)},
 'Russian': {'order': (3, 1, 2), 'seasonal_order': (1, 0, 1, 14)},
 'Japanese': {'order': (2, 1, 4), 'seasonal_order': (1, 0, 15, 7)},
 'French': {'order': (3, 1, 2), 'seasonal_order': (2, 0, 9, 7)},
 'German': {'order': (1, 1, 2), 'seasonal_order': (4, 0, 14, 7)},
 'Chinese': {'order': (1, 1, 5), 'seasonal_order': (4, 0, 14, 7)}}

Model Building¶

In [3]:
df=pd.read_parquet("../data/processed/df_pandas.parquet")
In [18]:
def performance(actual, predicted):
    print('MAE :', round(mae(actual, predicted), 3))
    print('RMSE :', round(mse(actual, predicted)**0.5, 3))
    print('MAPE:', round(mape(actual, predicted), 3))
In [23]:
# time based split
train = df.iloc[:int(0.8*len(df))]
test = df.iloc[int(0.8*len(df)):]

ARIMA¶

This model uses the MA and AR terms to forecast the time series data. The model is defined by the following parameters:

  • p: The number of lag observations included in the model, also called the lag order.
  • d: The number of times that the raw observations are differenced, also called the degree of differencing.
  • q: The size of the moving average window, also called the order of moving average.

However we are not able to capture the seasonality in the data using ARIMA model and also effect of exogenous variables.

Pipeline to handle all languages

In [99]:
results = []
models={}

for column in columns:
    print(f"Training {column}...")
    model = SARIMAX(train[column], order=params[column]["order"])
    model_fit = model.fit()
    forecast = model_fit.forecast(steps=len(test))
    # models[column] = model_fit
    
    results.append({
        "Language": column,
        'MAE': round(mae(test[column], forecast), 3),
        'RMSE': round(mse(test[column], forecast)**0.5, 3),
        'MAPE': round(mape(test[column], forecast), 3),
        'p': params[column]["order"][0],
        'd': params[column]["order"][1],
        'q': params[column]["order"][2],
    })
    
    # Plotting
    plt.figure(figsize=(20, 8))
    plt.plot(train[column], label="Train")
    plt.plot(test[column], label="Test")
    plt.plot(forecast, label="Forecast")
    plt.title(f"{column} - ARIMA Model")
    plt.legend()
    plt.show()

# Convert results to DataFrame
results_df = pd.DataFrame(results)
Training Spanish...
No description has been provided for this image
Training Japanese...
No description has been provided for this image
Training English...
No description has been provided for this image
Training Chinese...
No description has been provided for this image
Training German...
No description has been provided for this image
Training Russian...
No description has been provided for this image
Training French...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
  warn('Non-stationary starting autoregressive parameters'
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
  warn('Non-invertible starting MA parameters found.'
No description has been provided for this image

Observations

  • We can see that ARIMA model based on parameters extracted from ACF and PACF plots is able to capture the general trend in the data
  • However it is not able to capture the seasonality in the data for some languages

Results

In [42]:
results_df.sort_values("MAPE")
Out[42]:
Language MAE RMSE MAPE p d q
2 English 8306097.860 1.210796e+07 0.066 1 1 5
3 Chinese 438204.793 5.718531e+05 0.066 1 1 5
6 French 1140287.880 1.596890e+06 0.085 3 1 2
1 Japanese 1524784.613 1.889121e+06 0.097 2 1 1
4 German 1752924.870 2.408183e+06 0.098 1 1 2
5 Russian 1657838.177 2.290341e+06 0.107 3 1 2
0 Spanish 2348818.160 3.101108e+06 0.153 6 1 6

Observations

Above table shows the MAPE for each language using ARIMA model

SARIMA¶

This model is an extension of ARIMA model which also captures the seasonality in the data. The model contains the same parameter as ARIMA model with additional parameters for seasonality. Following are the additional parameters:

  • P: Seasonal lag order
  • D: Seasonal difference order
  • Q: Seasonal moving average order
  • S: The number of time steps for a single seasonal period

This is more advanced model as compared to ARIMA model. However it is not able to capture the effect of exogenous variables.

Pipeline to handle all languages

In [101]:
results = []
models={}
cache = {}

for column in columns:

    print(f"Training {column}...")
    model = SARIMAX(train[column], order=params[column]["order"], seasonal_order=params[column]["seasonal_order"])
    model_fit = model.fit()
    forecast = model_fit.forecast(steps=len(test))
    models[column] = model_fit
    
    results.append({
        "Language": column,
        'MAE': round(mae(test[column], forecast), 3),
        'RMSE': round(mse(test[column], forecast)**0.5, 3),
        'MAPE': round(mape(test[column], forecast), 3),
        'p': params[column]["order"][0],
        'd': params[column]["order"][1],
        'q': params[column]["order"][2],
        'P': params[column]["seasonal_order"][0],
        'D': params[column]["seasonal_order"][1],
        'Q': params[column]["seasonal_order"][2],
        'S': params[column]["seasonal_order"][3],
    })
    
    # Plotting
    plt.figure(figsize=(20, 8))
    plt.plot(train[column], label="Train")
    plt.plot(test[column], label="Test")
    plt.plot(forecast, label="Forecast")
    plt.title(f"{column} - SARIMA Model")
    plt.legend()
    plt.show()

# Convert results to DataFrame
results_df = pd.DataFrame(results)
Training Spanish...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
  warn('Non-invertible starting seasonal moving average'
No description has been provided for this image
Training Japanese...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
  warn('Non-invertible starting seasonal moving average'
No description has been provided for this image
Training English...
No description has been provided for this image
Training Chinese...
No description has been provided for this image
Training German...
No description has been provided for this image
Training Russian...
No description has been provided for this image
Training French...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
  warn('Non-stationary starting autoregressive parameters'
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
  warn('Non-invertible starting MA parameters found.'
No description has been provided for this image

Observations

  • We can see that SARIMA model is able to capture the seasonality in the data. We used the parameters extracted from ACF and PACF plots to build the model.
  • With proper tuning of parameters we will be able to better capture the seasonality in the data

Results

In [102]:
results_df.sort_values("MAPE")
Out[102]:
Language MAE RMSE MAPE p d q P D Q S
3 Chinese 4.226808e+05 5.364987e+05 0.065 1 1 5 4 0 14 7
6 French 1.247096e+06 1.711687e+06 0.092 3 1 2 2 0 9 7
0 Spanish 1.718605e+06 2.293428e+06 0.097 6 1 6 2 0 14 7
1 Japanese 1.599290e+06 1.961434e+06 0.102 2 1 4 1 0 15 7
2 English 1.214013e+07 1.400092e+07 0.103 1 1 5 1 0 1 7
5 Russian 2.022044e+06 2.622408e+06 0.132 3 1 2 1 0 1 14
4 German 2.456478e+06 3.004588e+06 0.141 1 1 2 4 0 14 7

SARIMAX¶

This model is an extension of SARIMA model which also captures the effect of exogenous variables. The model contains the same parameter as SARIMA model with additional parameters for exogenous variables.

The additional parameters are:

  • exog: The exogenous variable to be included in the model

Using this model we are able to capture the effect of exogenous variables on the time series data. This gives us more accurate forecast as compared to ARIMA and SARIMA models.

In [30]:
params["English"] = { "order": (6, 1, 6), "seasonal_order": (1, 0, 1, 7) }
model = SARIMAX(train["English"], order=params["English"]["order"], exog=train["Exog"], seasonal_order= params["English"]["seasonal_order"] )
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
performance(test["English"], forecast)
Out[30]:
0.052988798444877507
In [31]:
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
# plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
No description has been provided for this image

Zoomed in plot for English language prediction using exogenous variable

In [32]:
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • We can see that SARIMAX model has predicted the views count and has captured the trend, seasonality, along with the exogenous variable effect

Hyperparameter Tuning using Optuna¶

In [5]:
def performance(true_values, predictions):
    return mape(true_values, predictions)
In [ ]:
def objective(trial):

    p = trial.suggest_int('p', 0, 7)  # AR 
    # d = trial.suggest_int('d', 0, 2)  # Diff
    q = trial.suggest_int('q', 0, 7)  # MA 
    P = trial.suggest_int('P', 0, 2)  # Seasonal AR 
    D = trial.suggest_int('D', 0, 1)  # Seasonal diff
    Q = trial.suggest_int('Q', 0, 2)  # Seasonal MA 
    S = 7 # weekly seasonality

    order = (p, 1, q)
    seasonal_order = (P, D, Q, S)
    
    try:
        model = SARIMAX(train["English"], order=order, exog=train["Exog"], seasonal_order=seasonal_order)
        model_fit = model.fit(disp=False) 
        forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
        
        error = performance(test["English"], forecast)
    except Exception as e:
        error = np.inf

    return error

storage_name = "sqlite:///sarimax_optuna_study.db"

study = optuna.create_study(direction='minimize', study_name="sarimax_study", storage=storage_name, load_if_exists=True)

study.optimize(objective, n_trials=100)
In [75]:
print("Best hyperparameters: ", study.best_trial.params)
print("Best MAPE: ", study.best_value)
Best hyperparameters:  {'p': 3, 'q': 5, 'P': 0, 'D': 1, 'Q': 1}
Best MAPE:  0.04394953148191757
In [77]:
params["English"] = { "order": (3, 1, 5), "seasonal_order": (0, 1, 1, 7) }
model = SARIMAX(train["English"], order=params["English"]["order"], exog=train["Exog"], seasonal_order= params["English"]["seasonal_order"] )
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
performance(test["English"], forecast)
Out[77]:
0.04394953148191757
In [78]:
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
# plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
No description has been provided for this image
In [79]:
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
No description has been provided for this image

Observations

  • Using Optuna for hyperparameter tuning we can see that the MAPE has decreased to 0.043 from 0.052 for English language

Prophet Model¶

Pipeline to handle all languages

In [17]:
# prophet for all languages
results = []
for column in columns:
    fb_df = pd.DataFrame({
        "ds": df.index,
        "y": df[column],
    }).reset_index(drop=True)

    train = fb_df.iloc[:int(0.8*len(fb_df))]
    test = fb_df.iloc[int(0.8*len(fb_df)):]

    m = Prophet( weekly_seasonality=True )
    m.fit(train)

    future = m.make_future_dataframe(periods=len(test), freq="D")

    forecast = m.predict(future)

    forecast = forecast.set_index("ds")
    train = train.set_index("ds")
    test = test.set_index("ds")

    plt.figure(figsize=(20, 8))
    plt.plot(train["y"], label="Train")
    plt.plot(test["y"], label="Test")
    plt.plot(forecast["yhat"], label="Forecast")
    plt.fill_between(forecast.index, forecast["yhat_lower"], forecast["yhat_upper"], color='violet', alpha=0.2, label="Confidence Interval")
    plt.title(f"{column} - Prophet Model")
    plt.legend()
    plt.show()

    results.append({
        "Language": column,
        'MAPE': performance(test["y"], forecast.loc[test.index, "yhat"]),
    })
02:28:40 - cmdstanpy - INFO - Chain [1] start processing
02:28:40 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:41 - cmdstanpy - INFO - Chain [1] start processing
02:28:41 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:41 - cmdstanpy - INFO - Chain [1] start processing
02:28:41 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:42 - cmdstanpy - INFO - Chain [1] start processing
02:28:42 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:42 - cmdstanpy - INFO - Chain [1] start processing
02:28:42 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:43 - cmdstanpy - INFO - Chain [1] start processing
02:28:43 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image
02:28:43 - cmdstanpy - INFO - Chain [1] start processing
02:28:43 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image

Results

In [18]:
pd.DataFrame(results).sort_values("MAPE")
Out[18]:
Language MAPE
3 Chinese 0.079152
1 Japanese 0.123545
0 Spanish 0.147672
6 French 0.149335
4 German 0.154803
5 Russian 0.447796
2 English 0.448351

Prophet model using exogenous variable

In [19]:
fb_df_english = pd.DataFrame({
    "ds": df.index,
    "y": df["English"],
    "regressor": df["Exog"]
}).reset_index(drop=True)

train = fb_df_english.iloc[:int(0.8*len(fb_df_english))]
test = fb_df_english.iloc[int(0.8*len(fb_df_english)):]
m = Prophet( weekly_seasonality=True)
m.add_regressor('regressor')
m.fit(train)
future = m.make_future_dataframe(periods=len(test))
future["regressor"] = fb_df_english["regressor"]
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper','regressor']].tail()
forecast = forecast.set_index("ds")

train = train.set_index("ds")
test = test.set_index("ds")

plt.figure(figsize=(20, 8))
plt.plot(train["y"], label="Train")
plt.plot(test["y"], label="Test")
plt.plot(forecast["yhat"], label="Forecast")
plt.fill_between(forecast.index, forecast["yhat_lower"], forecast["yhat_upper"], color='violet', alpha=0.2, label="Confidence Interval")
plt.title("English - Prophet Model")
plt.legend()
plt.show()
02:49:37 - cmdstanpy - INFO - Chain [1] start processing
02:49:37 - cmdstanpy - INFO - Chain [1] done processing
No description has been provided for this image

Results

In [20]:
performance(test["y"], forecast.loc[test.index, "yhat"])
Out[20]:
0.06769322066744596

Observations

  • Using exogenous variable we are better able to capture the trend and seasonality in the data

Insights and Recomendation¶

AdEase can use SARIMAX model to forecast the number of views for different languages. For every language, a new model has to be built. Using the forecasted values, AdEase can optimize the ad placement for their clients.

Stationarity

We used AD Fuller test to check stationalrity of the data. Only Russian and Spanish language time series were stationary. We had to use 1st order differencing for all other languages to make the time series stationary

Page Views

  • Amongts all the languages, English articles had the highest number of views. This may be becasuse English is the most popular language. Ads can be placed on English articles to get maximum views.
  • Also the homepage of wikipedia has the highest number of views. AdEase can place ads on the homepage to get maximum views.

Monthly and Weekly Trends

  • For Spanish language October, September and November have the highest number of views
  • For japanese September, October, August and December have the highest number of views
  • English August, July have the highest number of views
  • For Chinese December, November, October have the highest number of views
  • For German December and November have the highest number of views
  • for Russian August and July have the highest number of views
  • For French December and November have the highest number of views

Above months can be considered as the peak months for the respective languages and ad placement can be optimized during these months

  • In most cases Sundays have the highest number of views

Whenever there was a campaign, the number of views increased significantly. This can seen in English time series.